CINTIL DependencyBank PREMIUM - A Corpus of Grammatical Dependencies for Portuguese
نویسندگان
چکیده
This paper presents a new linguistic resource for the study and computational processing of Portuguese. CINTIL DependencyBank PREMIUM is a corpus of Portuguese news text, accurately manually annotated with a wide range of linguistic information (morpho-syntax, named-entities, syntactic function and semantic roles), making it an invaluable resource specially for the development and evaluation of data-driven natural language processing tools. The corpus is under active development, reaching 4,000 sentences in its current version. The paper also reports on the training and evaluation of a dependency parser over this corpus. CINTIL DependencyBank PREMIUM is freely-available for research purposes through META-SHARE.
منابع مشابه
A PropBank for Portuguese: the CINTIL-PropBank
With the CINTIL-International Corpus of Portuguese, an ongoing corpus annotated with fully flegded grammatical representation, sentences get not only a high level of lexical, morphological and syntactic annotation but also a semantic analysis that prepares the data to a manual specification step and thus opens the way for a number of tools and resources for which there is a great research focus...
متن کاملTreebanking by Sentence and Tree Transformation: Building a Treebank to support Question Answering in Portuguese
Abstract This paper presents CINTIL-QATreebank, a treebank composed of Portuguese sentences that can be used to support the development of Question Answering systems. To create this treebank, we use declarative sentences from the pre-existing CINTIL-Treebank and manually transform their syntactic structure into a non-declarative sentence. Our corpus includes two clause types: interrogative and ...
متن کاملDeveloping a Deep Linguistic Databank Supporting a Collection of Treebanks: the CINTIL DeepGramBank
Corpora of sentences annotated with grammatical information have been deployed by extending the basic lexical and morphological data with increasingly complex information, such as phrase constituency, syntactic functions, semantic roles, etc. As these corpora grow in size and the linguistic information to be encoded reaches higher levels of sophistication, the utilization of annotation tools an...
متن کاملComplex Predicates Annotation in a Corpus of Portuguese
We present an annotation scheme for the annotation of complex predicates, understood as constructions with more than one lexical unit, each contributing part of the information normally associated with a single predicate. We discuss our annotation guidelines of four types of complex predicates, and the treatment of several difficult cases, related to ambiguity, overlap and coordination. We then...
متن کاملEighth International Conference on Language Resources and Evaluation
Evaluating a taxonomy learned automatically against an existing gold standard is a very complex problem, because differences stem from the number, label, depth and ordering of the taxonomy nodes. In this paper we propose casting the problem as one of comparing two hierarchical clusters. To this end we defined a variation of the Fowlkes and Mallows measure (Fowlkes and Mallows, 1983). Our method...
متن کامل